AMD Flang Weno Optimizations by danieljvickers · Pull Request #1374 · MFlowCode/MFC

danieljvickers · 2026-04-16T18:19:17Z

Description

Loop unwrapping to optimize AMD Flang GPU performance for WENO.

Type of change

Bug fix

Testing

Performance checking of the examples/3D_performance_test case

AI code reviews

Reviews are not triggered automatically. To request a review, comment on the PR:

@coderabbitai review — incremental review (new changes only)
@coderabbitai full review — full review from scratch
/review — Qodo review
/improve — Qodo code suggestions
@claude full review — Claude full review (also triggers on PR open/reopen/ready)
Add label claude-full-review — Claude full review via label

… boundaries

github-actions · 2026-04-16T18:22:01Z

Claude Code Review

Head SHA: 26e982e

Files changed:

2
CMakeLists.txt
src/simulation/m_weno.fpp

Findings

1. WENO5 final omega normalization hardcodes stencil indices 0–2 instead of using a loop (lines ~1104–1106 and ~1145–1147)

The PR replaces omega = alpha/sum(alpha) in the WENO5 path with:

omega(0) = alpha(0)/(alpha(0) + alpha(1) + alpha(2))
omega(1) = alpha(1)/(alpha(0) + alpha(1) + alpha(2))
omega(2) = alpha(2)/(alpha(0) + alpha(1) + alpha(2))

This hardcodes weno_num_stencils == 2. Every other conversion in this PR consistently uses do q = 0, weno_num_stencils. A do q loop with a precomputed denominator would be both consistent and avoid recomputing the sum three times. The current form silently gives wrong or incomplete results if weno_num_stencils differs.

2. Intermediate omega = alpha/sum(alpha) inside the mapped_weno branch is not converted (WENO3 and WENO5)

The mapped_weno branch computes an initial omega before remapping alpha:

do q = 0, weno_num_stencils
    alpha(q) = d_cbL/R_${XYZ}$(q, j)/(beta(q)**2._wp)
end do
omega = alpha/sum(alpha)   ! <- still an array operation
do q = 0, weno_num_stencils
    alpha(q) = ...remapped formula using omega(q)...
end do

The surrounding alpha assignments were converted to do q loops to fix AMD flang GPU array-section issues, but the intermediate omega = alpha/sum(alpha) — which uses the same sum(array_section) construct — was left unchanged. This intermediate statement will still hit the same compiler limitation on AMD flang OpenMP GPU builds, silently producing wrong omega values when mapped_weno is active.

3. WENO3 final omega = alpha/sum(alpha) not converted

The WENO5 path's final omega = alpha/sum(alpha) is converted (finding 1 above), but the identical statement in the WENO3 path (outside the if/else block, in the weno_order == 3 section) is left as an array operation. If AMD flang OpenMP GPU builds cannot handle omega = alpha/sum(alpha) for WENO5, the same issue applies to WENO3 (weno_num_stencils == 1, two stencils). The fix appears incomplete for third-order WENO.

… into weno-array-unwrap

coderabbitai · 2026-04-20T19:18:33Z

📝 Walkthrough

Walkthrough

This pull request contains two primary modifications: First, the CMakeLists.txt expands compile options for OpenMP GPU offloading with the LLVMFlang compiler, adding -fopenmp-target-fast, -fopenmp-assume-threads-oversubscription, and -fopenmp-assume-teams-oversubscription to the existing flags while keeping link options unchanged. Second, the m_weno.fpp file refactors WENO computation logic by replacing array section assignments with explicit element-wise loops indexed by q across multiple WENO schemes (wenojs, mapped_weno, wenoz), and restructures omega normalization in higher-order WENO paths from array division to individual scalar assignments.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description is incomplete. It lacks required sections: missing 'Fixes #(issue)' reference, unchecked type-of-change boxes, incomplete testing details, and missing GPU testing checklist items.	Complete the pull request description template: reference any related issue, check appropriate type-of-change boxes, provide detailed testing steps, and fill out the GPU changes checklist with results.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: loop unwrapping optimizations for WENO code targeting AMD Flang GPU performance.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

src/simulation/m_weno.fpp (2)
1107-1109: Optional: hoist the shared denominator and consider applying the same unroll to order-3/order-7.

Two minor points on the order-5 omega normalization changes at 1107-1109 and 1148-1150:
The expression (alpha(0) + alpha(1) + alpha(2)) is recomputed three times per side. A compiler will likely CSE, but an explicit temporary makes intent clearer and is cheaper to read/verify:
-                                        omega(0) = alpha(0)/(alpha(0) + alpha(1) + alpha(2))
-                                        omega(1) = alpha(1)/(alpha(0) + alpha(1) + alpha(2))
-                                        omega(2) = alpha(2)/(alpha(0) + alpha(1) + alpha(2))
+                                        tau = alpha(0) + alpha(1) + alpha(2)   ! reuse scratch or add a local
+                                        omega(0) = alpha(0)/tau
+                                        omega(1) = alpha(1)/tau
+                                        omega(2) = alpha(2)/tau
(Use a dedicated local, not tau, if it's still live elsewhere — here it is not after this point in the left block. For the right block at 1148-1150, tau is still reusable by wenoz above, so add a fresh scalar and declare it private in the outer GPU_PARALLEL_LOOP.)
The same omega = alpha/sum(alpha) pattern still exists at lines 986, 1015 (order-3), 1304, 1366 (order-7), and inside the order-5 mapped_weno/teno branches (1070, 1094). If the AMD Flang perf regression you targeted is specifically the array sum() reduction, consider applying the unroll uniformly (or justify in a comment why only the order-5 final normalization was unrolled). As-is, the half-done refactor is a small consistency wart.
965-983: Add GPU_LOOP(parallelism='[seq]') markers to new q loops for consistency with adjacent code.

The newly introduced do q = 0, weno_num_stencils loops inside GPU_PARALLEL_LOOP kernels (order-3 at lines 965–983 and 996–1012, order-5 at lines 1063–1075 and 1123–1135, order-7 at lines 1264–1276 and 1340–1352) lack explicit $:GPU_LOOP(parallelism='[seq]') annotations, while sibling wenoz/teno q loops in the same kernels carry these markers (e.g., lines 1080, 1088, 1096, 1137, 1142, 1280, 1292, 1354, 1360). Although inner loops default to sequential execution when unmarked (confirmed across nvfortran, Cray ftn, and AMD Flang), adding explicit seq markers maintains code clarity and consistency with established patterns in this function.
♻️ Example for order-3 left block
                                     if (wenojs) then
+                                        $:GPU_LOOP(parallelism='[seq]')
                                         do q = 0, weno_num_stencils
                                             alpha(q) = d_cbL_${XYZ}$ (q, j)/(beta(q)**2._wp)
                                         end do
                                     else if (mapped_weno) then
+                                        $:GPU_LOOP(parallelism='[seq]')
                                         do q = 0, weno_num_stencils
                                             alpha(q) = d_cbL_${XYZ}$ (q, j)/(beta(q)**2._wp)
                                         end do
                                         omega = alpha/sum(alpha)
+                                        $:GPU_LOOP(parallelism='[seq]')
                                         do q = 0, weno_num_stencils
                                             alpha(q) = (d_cbL_${XYZ}$ (q, j)*(1._wp + d_cbL_${XYZ}$ (q, &
                                                   & j) - 3._wp*omega(q)) + omega(q)**2._wp)*(omega(q)/(d_cbL_${XYZ}$ (q, &
                                                   & j)**2._wp + omega(q)*(1._wp - 2._wp*d_cbL_${XYZ}$ (q, j))))
                                         end do
                                     else if (wenoz) then
                                         ! Borges, et al. (2008)
                                         tau = abs(beta(1) - beta(0))
                                         do q = 0, weno_num_stencils
                                             alpha(q) = d_cbR_${XYZ}$ (q, j)*(1._wp + tau/beta(q))
                                         end do
Apply the same treatment to the mirrored right-side blocks (lines 996–1012, 1123–1135, 1340–1352) and order-5/7 counterparts (lines 1063–1075, 1264–1276).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: da41e88b-a73a-4de0-abe0-bebeba2682cb

📥 Commits

Reviewing files that changed from the base of the PR and between 92585b1 and 26e982e.

📒 Files selected for processing (2)

CMakeLists.txt
src/simulation/m_weno.fpp

coderabbitai · 2026-04-20T19:18:36Z

                    target_link_options(${a_target} PRIVATE -fopenmp)
                elseif(CMAKE_Fortran_COMPILER_ID STREQUAL "LLVMFlang")
-                    target_compile_options(${a_target} PRIVATE -fopenmp --offload-arch=gfx90a)
+                    target_compile_options(${a_target} PRIVATE -fopenmp --offload-arch=gfx90a -fopenmp-target-fast -fopenmp-assume-threads-oversubscription -fopenmp-assume-teams-oversubscription)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail fc="${FC:-flang}" if ! command -v "$fc" >/dev/null 2>&1; then echo "Fortran compiler '$fc' not found; set FC to the LLVMFlang/flang binary to verify these flags." >&2 exit 2 fi "$fc" --help 2>&1 | grep -E -- \ '-fopenmp-target-fast|-fopenmp-assume-threads-oversubscription|-fopenmp-assume-teams-oversubscription|--offload-arch' \ || { echo "One or more LLVMFlang OpenMP offload flags are not advertised by this compiler." >&2 exit 1 }

Repository: MFlowCode/MFC

Length of output: 152

🏁 Script executed:

# Check if CMakeLists.txt exists and get file size wc -l CMakeLists.txt

Repository: MFlowCode/MFC

Length of output: 76

🏁 Script executed:

# Read lines 630-650 to see the context around line 639 sed -n '630,650p' CMakeLists.txt

Repository: MFlowCode/MFC

Length of output: 1484

🏁 Script executed:

# Search for GPU architecture configuration patterns in CMakeLists.txt rg -n "offload-arch|AMDGPU|AMD_GPU|GPU_ARCH" CMakeLists.txt

Repository: MFlowCode/MFC

Length of output: 352

🏁 Script executed:

# Search for LLVMFlang or flang compiler handling rg -n "LLVMFlang\|flang" CMakeLists.txt

Repository: MFlowCode/MFC

Length of output: 39

🏁 Script executed:

# Search for GPU architecture configuration patterns more broadly rg -n "GPU|ARCH|offload" CMakeLists.txt | head -30

Repository: MFlowCode/MFC

Length of output: 1453

🏁 Script executed:

# Search for MFC_* CMake options rg -n "^option\|^set.*CACHE" CMakeLists.txt | head -40

Repository: MFlowCode/MFC

Length of output: 39

🏁 Script executed:

# Get more context around the LLVMFlang block - lines 625-660 sed -n '625,660p' CMakeLists.txt

Repository: MFlowCode/MFC

Length of output: 2309

🏁 Script executed:

# Check if there are any other references to gfx90a or MI250X in the repo rg -n "gfx90a|MI250X|AMD" --type-add 'cmake:*.cmake' -t cmake

Repository: MFlowCode/MFC

Length of output: 559

🏁 Script executed:

# Search for any existing CMake configuration or detection of AMD GPU architecture rg -n "AMDGPU\|amdgpu\|ROCM\|rocm\|HIP" CMakeLists.txt

Repository: MFlowCode/MFC

Length of output: 39

🏁 Script executed:

# Check git history to see if these lines were modified in this PR git log -p --follow -S "offload-arch=gfx90a" -- CMakeLists.txt | head -100

Repository: MFlowCode/MFC

Length of output: 39

🏁 Script executed:

# Get the git diff to see what actually changed in this PR git diff HEAD~1 CMakeLists.txt | grep -A 5 -B 5 "LLVMFlang" || echo "No diff found or file unchanged"

Repository: MFlowCode/MFC

Length of output: 292

🏁 Script executed:

# Search for any mention of build type (Release/Debug) constraints on GPU flags rg -n "CMAKE_BUILD_TYPE\|Release\|Debug" CMakeLists.txt | head -20

Repository: MFlowCode/MFC

Length of output: 39

Gate the LLVMFlang fast-offload assumptions and GPU arch.

Lines 639–640 hardcode --offload-arch=gfx90a for both compile and link, and line 639 unconditionally enables -fopenmp-target-fast -fopenmp-assume-threads-oversubscription -fopenmp-assume-teams-oversubscription for all LLVMFlang builds, including Debug and RelDebug. ROCm documents -fopenmp-target-fast as requiring its constraints to be satisfied and implying extra assumptions (e.g., ignoring target env vars, applying -O3 when no -O* flag is set), which can alter debugging and performance-tuning behavior. Additionally, hardcoding --offload-arch=gfx90a limits this path to MI250X-class GPUs; the codebase's own FIXME comment at line 645 (in the GNU section) acknowledges this limitation.

Make the GPU architecture configurable via CMake and gate the fast assumptions behind Release build type and/or an explicit option.

Suggested direction

+option(MFC_AMD_OPENMP_FAST "Enable LLVMFlang AMD OpenMP fast target assumptions" ON) +set(MFC_AMDGPU_ARCH "native" CACHE STRING "AMD GPU architecture for LLVMFlang OpenMP offload") elseif(CMAKE_Fortran_COMPILER_ID STREQUAL "LLVMFlang") - target_compile_options(${a_target} PRIVATE -fopenmp --offload-arch=gfx90a -fopenmp-target-fast -fopenmp-assume-threads-oversubscription -fopenmp-assume-teams-oversubscription) - target_link_options(${a_target} PRIVATE -fopenmp --offload-arch=gfx90a) + target_compile_options(${a_target} PRIVATE -fopenmp "--offload-arch=${MFC_AMDGPU_ARCH}") + target_link_options(${a_target} PRIVATE -fopenmp "--offload-arch=${MFC_AMDGPU_ARCH}") + + if (MFC_AMD_OPENMP_FAST AND CMAKE_BUILD_TYPE STREQUAL "Release") + target_compile_options(${a_target} PRIVATE + -fopenmp-target-fast + -fopenmp-assume-threads-oversubscription + -fopenmp-assume-teams-oversubscription) + endif() endif()

codecov · 2026-04-20T21:20:32Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 64.66%. Comparing base (92585b1) to head (26e982e).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1374      +/-   ##
==========================================
+ Coverage   64.62%   64.66%   +0.04%     
==========================================
  Files          71       71              
  Lines       18407    18432      +25     
  Branches     1516     1517       +1     
==========================================
+ Hits        11895    11919      +24     
  Misses       5555     5555              
- Partials      957      958       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Co-authored-by: Daniel Vickers <danieljvickers@login09.frontier.olcf.ornl.gov> Co-authored-by: Daniel Vickers <danieljvickers@frontier00007.frontier.olcf.ornl.gov> Co-authored-by: Daniel Vickers <danieljvickers@login12.frontier.olcf.ornl.gov> Co-authored-by: wilfonba <bwilfong3@gatech.edu> Co-authored-by: Daniel Vickers <danieljvickers@frontier01665.frontier.olcf.ornl.gov>

danieljvickers and others added 10 commits March 15, 2026 07:51

Remove interior point conservative variable protection for stationary…

4273c84

… boundaries

Merge branch 'master' of github.com:danieljvickers/MFC

4b2f519

Merge branch 'MFlowCode:master' into master

ba0cacb

Merge branch 'MFlowCode:master' into master

e6988ba

Merge branch 'MFlowCode:master' into master

0ce6f11

Merge branch 'MFlowCode:master' into master

16bd456

Merge branch 'MFlowCode:master' into master

f0e117d

unwrapped more alpha assigments in MFC

2c9b0fc

unwrapped more

4cccc57

AMD speedup achieved

2cf56df

Daniel Vickers and others added 9 commits April 16, 2026 14:30

Remove hanging statement

03979ef

Loops work

d7693a9

Merge branch 'master' into weno-array-unwrap

aa16367

matched array unwrapping with the rest of the weno conditions

f49f69f

Merge branch 'master' into weno-array-unwrap

02fb497

Added to other wenos

ea07618

bug fix for WENOZ3

7e26f9a

Added case optimization flags for AMD Flang

219ff34

Merge branch 'weno-array-unwrap' of github.com:danieljvickers/MFC-Dan…

26e982e

… into weno-array-unwrap

danieljvickers marked this pull request as ready for review April 20, 2026 19:12

coderabbitai Bot reviewed Apr 20, 2026

View reviewed changes

sbryngelson approved these changes Apr 20, 2026

View reviewed changes

sbryngelson merged commit ee8ad84 into MFlowCode:master Apr 21, 2026
161 of 198 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMD Flang Weno Optimizations#1374

AMD Flang Weno Optimizations#1374
sbryngelson merged 19 commits intoMFlowCode:masterfrom
danieljvickers:weno-array-unwrap

danieljvickers commented Apr 16, 2026

Uh oh!

github-actions Bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 20, 2026

Walkthrough

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 20, 2026

Uh oh!

codecov Bot commented Apr 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

Conversation

danieljvickers commented Apr 16, 2026

Description

Type of change

Testing

AI code reviews

Uh oh!

github-actions Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude Code Review

Findings

Uh oh!

coderabbitai Bot commented Apr 20, 2026

Walkthrough

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented Apr 16, 2026 •

edited

Loading

codecov Bot commented Apr 20, 2026 •

edited

Loading